7 Libraries for Python & R




Introduction

If you follow me, you know that this year I started a series called Weekly Digest for Data Science and AI: Python & R , where I highlighted the best libraries, repos, packages, and tools that help us be better data scientists for all kinds of tasks.

The great folks at Heartbeat sponsored a lot of these digests, and they asked me to create a list of the best of the best—those libraries that really changed or improved the way we worked this year (and beyond).

If you want to read the past digests, take a look here:

Weekly Digest for Data Science and AI - Revue
Weekly Digest for Data Science and AI - Personal newsletter of Favio Vázquez... www.getrevue.co
Disclaimer: This list is based on the libraries and packages I reviewed in my personal newsletter. All of them were trending in one way or another among programmers, data scientists, and AI enthusiasts. Some of them were created before 2018, but if they were trending, they could be considered.

Top 7 for Python

7. AdaNet — Fast and flexible AutoML with learning guarantees.

https://github.com/tensorflow/adanet

AdaNet is a lightweight and scalable TensorFlow AutoML framework for training and deploying adaptive neural networks using the AdaNet algorithm [ Cortes et al. ICML 2017 ]. AdaNet combines several learned subnetworks in order to mitigate the complexity inherent in designing effective neural networks.

This package will help you selecting optimal neural network architectures, implementing an adaptive algorithm for learning a neural architecture as an ensemble of subnetworks.

You will need to know TensorFlow to use the package because it implements a TensorFlow Estimator, but this will help you simplify your machine learning programming by encapsulating training and also evaluation, prediction and export for serving.

You can build an ensemble of neural networks, and the library will help you optimize an objective that balances the trade-offs between the ensemble’s performance on the training set and its ability to generalize to unseen data.

Installation

adanet depends on bug fixes and enhancements not present in TensorFlow releases prior to 1.7. You must install or upgrade your TensorFlow package to at least 1.7:

$ pip install "tensorflow>=1.7.0"

Installing from source

To install from source, you’ll first need to install bazel following their installation instructions .

Next clone adanet and cd into its root directory:

$ git clone 
https://github.com/tensorflow/adanet
 && cd adanet

From the adanet root directory run the tests:

$ cd adanet
$ bazel test -c opt //...

Once you have verified that everything works well, install adanet as a pip package .

You’re now ready to experiment with adanet .

import adanet

Usage

Here you can find two examples on the usage of the package:

tensorflow/adanet
Fast and flexible AutoML with learning guarantees. — tensorflow/adanet github.com

You can read more about it in the original blog post:

Introducing AdaNet: Fast and Flexible AutoML with Learning Guarantees
Posted by Charles Weill, Software Engineer, Google AI, NYC Ensemble learning , the art of combining different machine… ai.googleblog.com

6. TPOT— An automated Python machine learning tool that optimizes machine learning pipelines using genetic programming.

https://github.com/EpistasisLab/tpot

Previously I talked about Auto-Keras, a great library for AutoML in the Pythonic world. Well, I have another very interesting tool for that.

The name is TPOT (Tree-based Pipeline Optimization Tool), and it’s an amazing library. It’s basically a Python automated machine learning tool that optimizes machine learning pipelines using genetic programming .

TPOT can automate a lot of stuff life feature selection, model selection, feature construction, and much more. Luckily, if you’re a Python machine learner, TPOT is built on top of Scikit-learn, so all of the code it generates should look familiar.

What it does is automate the most tedious parts of machine learning by intelligently exploring thousands of possible pipelines to find the best one for your data, and then it provides you with the Python code for the best pipeline it found so you can tinker with the pipeline from there.

This is how it works:

For more details you can read theses great article by Matthew Mayo :

Using AutoML to Generate Machine Learning Pipelines with TPOT
Thus far in this series of posts we have: This post will take a different approach to constructing pipelines. Certainly… www.kdnuggets.com

and Randy Olson :

TPOT: A Python Tool for Automating Data Science
By Randy Olson, University of Pennsylvania. Machine learning is often touted as: A field of study that gives computers… www.kdnuggets.com

Installation

You actually need to follow some instructions before installing TPOT. Here they are:

Installation — TPOT
Optionally, you can install XGBoost if you would like TPOT to use the eXtreme Gradient Boosting models. XGBoost is… epistasislab.github.io

After that you can just run:


pip install tpot

Examples:

First let’s start with the basic Iris dataset:

So here we built a very basic TPOT pipeline that will try to look for the best ML pipeline to predict the iris.target . And then we save that pipeline. After that, what we have to do is very simple — load the .py file you generated and you’ll see:

import  numpy  as  np
from  sklearn.kernel_approximation  import RBFSampler
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.tree import DecisionTreeClassifier

# NOTE: Make sure that the class is labeled 'class' in the data file

tpot_data = np.recfromcsv('PATH/TO/DATA/FILE', delimiter='COLUMN_SEPARATOR', dtype=np.float64)
features = np.delete(tpot_data.view(np.float64).reshape(tpot_data.size, -1), tpot_data.dtype.names.index('class'), axis=1)
training_features, testing_features, training_classes, testing_classes = \
train_test_split(features, tpot_data['class'], random_state=42)
exported_pipeline = make_pipeline(
RBFSampler(gamma=0.8500000000000001),
DecisionTreeClassifier(criterion="entropy", max_depth=3, min_samples_leaf=4, min_samples_split=9)
)
exported_pipeline.fit(training_features, training_classes)
results = exported_pipeline.predict(testing_features)

And that’s it. You built a classifier for the Iris dataset in a simple but powerful way.

Let’s go the MNIST dataset now:

As you can see, we did the same! Let’s load the .py file you generated again and you’ll see:


import
 
 numpy
 
 as
 
 np

from
 
 sklearn.model_selection
 
 import
 train_test_split
from sklearn.neighbors import KNeighborsClassifier

# NOTE: Make sure that the class is labeled 'class' in the data file

tpot_data = np.recfromcsv('PATH/TO/DATA/FILE', delimiter='COLUMN_SEPARATOR', dtype=np.float64)
features = np.delete(tpot_data.view(np.float64).reshape(tpot_data.size, -1), tpot_data.dtype.names.index('class'), axis=1)
training_features, testing_features, training_classes, testing_classes = \
train_test_split(features, tpot_data['class'], random_state=42)
exported_pipeline = KNeighborsClassifier(n_neighbors=4, p=2, weights="distance")
exported_pipeline.fit(training_features, training_classes)
results = exported_pipeline.predict(testing_features)

Super easy and fun. Check them out! Try it and please give them a star!


5. SHAP — A unified approach to explain the output of any machine learning model

https://github.com/slundberg/shap

Explaining machine learning models isn’t always easy. Yet it’s so important for a range of business applications. Luckily, there are some great libraries that help us with this task. In many applications, we need to know, understand, or prove how input variables are used in the model, and how they impact final model predictions.

SHAP (SHapley Additive exPlanations) is a unified approach to explain the output of any machine learning model. SHAP connects game theory with local explanations, uniting several previous methods and representing the only possible consistent and locally accurate additive feature attribution method based on expectations.

Installation

SHAP can be installed from PyPI


pip install shap

or conda-forge


conda install -c conda-forge shap

Usage

There are tons of different models and ways to use the package. Here, I’ll take one example from the DeepExplainer.

Deep SHAP is a high-speed approximation algorithm for SHAP values in deep learning models that builds on a connection with DeepLIFT , as described in the SHAP NIPS paper that you can read here:

[1802.03888] Consistent Individualized Feature Attribution for Tree Ensembles
Abstract: Interpreting predictions from tree ensemble methods such as gradient boosting machines and random forests is… arxiv.org

Here you can see how SHAP can be used to explain the result of a Keras model for the MNIST dataset:

You can find more examples here:

slundberg/shap
A unified approach to explain the output of any machine learning model. — slundberg/shap github.com

Take a look. You’ll be surprised :)


4. Optimus — 🚚 Agile Data Science Workflows made easy with Python and Spark.

https://github.com/ironmussa/Optimus

Ok, so full disclosure, this library is like my baby. I’ve been working on it for a long time now, and I’m very happy to show you version 2.

Optimus V2 was created to make data cleaning a breeze. The API was designed to be super easy for newcomers and very familiar for people that come from working with pandas. Optimus expands the Spark DataFrame functionality, adding .rows and .cols attributes.

With Optimus you can clean your data, prepare it, analyze it, create profilers and plots, and perform machine learning and deep learning, all in a distributed fashion, because on the back-end we have Spark, TensorFlow, and Keras.

It’s super easy to us. It’s like the evolution of pandas, with a piece of dplyr, joined by Keras and Spark. The code you create with Optimus will work on your local machine, and with a simple change of masters, it can run on your local cluster or in the cloud.

You will see a lot of interesting functions created to help with every step of the data science cycle.

Optimus is perfect as a companion for an agile methodology for data science because it can help you in almost all the steps of the process, and it can easily connect to other libraries and tools.

If you want to read more about an Agile DS Methodology check this out:

Agile Framework For Creating An ROI-Driven Data Science Practice
Data Science is an amazing field of research that is under active development both from the academia and the industry… www.business-science.io

Installation (pip):


pip install optimuspyspark

Usage:

As one example, you can load data from a url, transform it, and apply some predefined cleaning functions:

from optimus import Optimus
op = Optimus()
# This is a custom function
def func(value, arg):
return "this was a number"

df =op.load.url("https://raw.githubusercontent.com/ironmussa/Optimus/master/examples/foo.csv")
df\
.rows.sort("product","desc")\
.cols.lower(["firstName","lastName"])\
.cols.date_transform("birth", "new_date", "yyyy/MM/dd", "dd-MM-YYYY")\
.cols.years_between("birth", "years_between", "yyyy/MM/dd")\
.cols.remove_accents("lastName")\
.cols.remove_special_chars("lastName")\
.cols.replace("product","taaaccoo","taco")\
.cols.replace("product",["piza","pizzza"],"pizza")\
.rows.drop(df["id"]<7)\
.cols.drop("dummyCol")\
.cols.rename(str.lower)\
.cols.apply_by_dtypes("product",func,"string", data_type="integer")\
.cols.trim("*")\
.show()

You can transform this:

into this:

Pretty cool, right?

You can do a thousand more things with the library, so please check it out:

Optimus — Data cleansing and exploration made simple
Prepare, process and explore your Big Data with the fastest open source library on the planet using Apache Spark and… www.hioptimus.com

3. spacy — Industrial-strength Natural Language Processing (NLP) with Python and Cython

https://spacy.io/

From the creators :

spaCy is designed to help you do real work — to build real products, or gather real insights. The library respects your time, and tries to avoid wasting it. It’s easy to install, and its API is simple and productive. We like to think of spaCy as the Ruby on Rails of Natural Language Processing.

spaCy is the best way to prepare text for deep learning. It interoperates seamlessly with TensorFlow, PyTorch, Scikit-learn, Gensim, and the rest of Python’s awesome AI ecosystem. With spaCy, you can easily construct linguistically sophisticated statistical models for a variety of NLP problems.

Installation:

pip3 install spacy
$ python3 -m spacy download en

Here, we’re also downloading the English language model. You can find models for German, Spanish, Italian, Portuguese, French, and more here:

Models Overview · spaCy Models Documentation
spaCy is a free open-source library for Natural Language Processing in Python. It features NER, POS tagging, dependency… spacy.io

Here’s an example from the main webpage:

# python -m spacy download en_core_web_sm
import spacy
# Load English tokenizer, tagger, parser, NER and word vectors
nlp = spacy.load('en_core_web_sm')
# Process whole documents
text = (u"When Sebastian Thrun started working on self-driving cars at "
u"Google in 2007, few people outside of the company took him "
u"seriously. “I can tell you very senior CEOs of major American "
u"car companies would shake my hand and turn away because I wasn’t "
u"worth talking to,” said Thrun, now the co-founder and CEO of "
u"online higher education startup Udacity, in an interview with "
u"Recode earlier this week.")
doc = nlp(text)
# Find named entities, phrases and concepts
for entity in doc.ents:
print(entity.text, entity.label_)
# Determine semantic similarities
doc1 = nlp(u"my fries were super gross")
doc2 = nlp(u"such disgusting fries")
similarity = doc1.similarity(doc2)
print(doc1.text, doc2.text, similarity)

In this example, we first download the English tokenizer, tagger, parser, NER, and word vectors. Then we create some text, and finally we print the entities, phrases, and concepts found, and then we determine the semantic similarity of the two phrases. If you run this code you get this:

Sebastian Thrun PERSON
Google ORG
2007 DATE
American NORP
Thrun PERSON
Recode ORG
earlier this week DATE
my fries were super gross such disgusting fries 0.7139701635071919

Very simple and super useful. There is also a spaCy Universe, where you can find great resources developed with or for spaCy. It includes standalone packages, plugins, extensions, educational materials, operational utilities, and bindings for other languages:

Universe · spaCy
This section collects the many great resources developed with or for spaCy. It includes standalone packages, plugins… spacy.io

By the way, the usage page is great, with very good explanations and code:

Install spaCy · spaCy Usage Documentation
spaCy is a free open-source library for Natural Language Processing in Python. It features NER, POS tagging, dependency… spacy.io

Take a look at the visualizers page. Awesome features, here:

Visualizers · spaCy Usage Documentation
spaCy is a free open-source library for Natural Language Processing in Python. It features NER, POS tagging, dependency… spacy.io

2. jupytext — Jupyter notebooks as Markdown Documents, Julia, Python or R scripts

For me, this is one of the packages of the year. It’s such an important part of what we do as data scientists. Almost all of us work in notebooks like Jupyter, but we also use IDEs like PyCharm for more hardcore parts of our projects.

The good news is that plain scripts, which you can draft and test in your favorite IDE, open transparently as notebooks in Jupyter when using Jupytext. Run the notebook in Jupyter to generate the outputs, associate an .ipynb representation, and save and share your research as either a plain script or as a traditional Jupyter notebook with outputs.

You can see a workflow of what you can do with the package in the gif below:

Installation

Install Jupytext with:

pip install jupytext --upgrade

Then, configure Jupyter to use Jupytext:

  • generate a Jupyter config, if you don’t have one yet, with jupyter notebook --generate-config
  • edit .jupyter/jupyter_notebook_config.py and append the following:
c.NotebookApp.contents_manager_class = "jupytext.TextFileContentsManager"
  • and restart Jupyter, i.e. run:
jupyter notebook

You can give it a try here:

Binder (beta)
https://mybinder.org/v2/gh/mwouts/jupytext/master?filepath=demo mybinder.org

1. Chartify — Python library that makes it easy for data scientists to create charts.

https://xkcd.com/1945/

This, for me, is the winner of the year, for Python. If you are in the Python world, most likely you waste a lot of your time trying to create a decent plot. Luckily, we have libraries like Seaborn that make our life easier. But the issue is that their plots are not dynamic.

Then you have Bokeh—an amazing library—but creating interactive plots with it can be a pain in the a**. If you want to know more about Bokeh and interactive plots for Data Science, take a look at these great articles by William Koehrsen :

Data Visualization with Bokeh in Python, Part I: Getting Started
Elevate your visualization game towardsdatascience.com
Data Visualization with Bokeh in Python, Part II: Interactions
Moving beyond static plots towardsdatascience.com
Data Visualization with Bokeh in Python, Part III: Making a Complete Dashboard
Creating an interactive visualization application in Bokeh towardsdatascience.com

Chartify is built in top of Bokeh. But it’s also so much simpler.

From the authors:

Why use Chartify?

  • Consistent input data format: Spend less time transforming data to get your charts to work. All plotting functions use a consistent tidy input data format.
  • Smart default styles: Create pretty charts with very little customization required.
  • Simple API: We’ve attempted to make to the API as intuitive and easy to learn as possible.
  • Flexibility: Chartify is built on top of Bokeh , so if you do need more control you can always fall back on Bokeh’s API.

Installation

  1. Chartify can be installed via pip:

pip3 install chartify

2. Install chromedriver requirement (Optional. Needed for PNG output):

  • Install Google Chrome.
  • Download the appropriate version of chromedriver for your OS here .
  • Copy the executable file to a directory within your PATH.
  • View directories in your PATH variable: echo $PATH
  • Copy chromedriver to the appropriate directory, e.g.: cp chromedriver /usr/local/bin

Usage

Let’s say we want to create this chart:

import pandas as pd
import chartify
# Generate example data
data = chartify.examples.example_data()

Now that we have some example data loaded let’s do some transformations:

total_quantity_by_month_and_fruit = (data.groupby(
[data['date'] + pd.offsets.MonthBegin(-1), 'fruit'])['quantity'].sum()
.reset_index().rename(columns={'date': 'month'})
.sort_values('month'))
print(total_quantity_by_month_and_fruit.head())
month          fruit     quantity
0 2017-01-01 Apple 7
1 2017-01-01 Banana 6
2 2017-01-01 Grape 1
3 2017-01-01 Orange 2
4 2017-02-01 Apple 8

And now we can plot it:

# Plot the data
ch = chartify.Chart(blank_labels=True, x_axis_type='datetime')
ch.set_title("Stacked area")
ch.set_subtitle("Represent changes in distribution.")
ch.plot.area(
data_frame=total_quantity_by_month_and_fruit,
x_column='month',
y_column='quantity',
color_column='fruit',
stacked=True)
ch.show('png')

Super easy to create a plot, and it’s interactive. If you want more examples to create stuff like this:

And more, check the original repo:

spotify/chartify
Python library that makes it easy for data scientists to create charts. — spotify/chartify github.com

Top 7 for R

7. infer — An R package for tidyverse-friendly statistical inference

https://github.com/tidymodels/infer

Inference, or statistical inference, is the process of using data analysis to deduce properties of an underlying probability distribution.

The objective of this package is to perform statistical inference using an expressive statistical grammar that coheres with the tidyverse design framework.

Installation

To install the current stable version of infer from CRAN:

install.packages("infer")

Usage

Let’s try a simple example on the mtcars dataset to see what the library can do for us.

First, let’s overwrite mtcars so that the variables cyl , vs , am , gear , and carb are factor s.


library(infer)
library(dplyr)
mtcars <- mtcars %>%
mutate(cyl = factor(cyl),
vs = factor(vs),
am = factor(am),
gear = factor(gear),
carb = factor(carb))
# For reproducibility
set.seed(2018)

We’ll try hypothesis testing. Here, a hypothesis is proposed so that it’s testable on the basis of observing a process that’s modeled via a set of random variables. Normally, two statistical data sets are compared, or a data set obtained by sampling is compared against a synthetic data set from an idealized model.


mtcars %>%
specify (response = mpg) %>% # formula alt: mpg ~ NULL
hypothesize (null = "point", med = 26) %>%
generate (reps = 100, type = "bootstrap") %>%
calculate (stat = "median")

Here, we first specify the response and explanatory variables, then we declare a null hypothesis. After that, we generate resamples using bootstrap and finally calculate the median. The result of that code is:


## # A tibble: 100 x 2
## replicate stat
## <int> <dbl>
## 1 1 26.6
## 2 2 25.1
## 3 3 25.2
## 4 4 24.7
## 5 5 24.6
## 6 6 25.8
## 7 7 24.7
## 8 8 25.6
## 9 9 25.0
## 10 10 25.1
## # ... with 90 more rows

One of the greatest parts of this library is the visualize function. This will allow you to visualize the distribution of the simulation-based inferential statistics or the theoretical distribution (or both). For an example, let’s use the flights data set. First, let’s do some data preparation:


library(nycflights13)
library(dplyr)
library(ggplot2)
library(stringr)
library(infer)
set.seed(2017)
fli_small <- flights %>% 
  na.omit() %>% 
  
  sample_n
(size = 500) %>% 
  
  mutate
(season = 
case_when
(
    month %in% c(10:12, 1:3) ~ "winter",
    month %in% c(4:9) ~ "summer"
  )) %>% 
  
  mutate
(day_hour = 
case_when
(
    
    between
(hour, 1, 12) ~ "morning",
    
    between
(hour, 13, 24) ~ "not morning"
  )) %>% 
  
  select
(arr_delay, dep_delay, season, 
         day_hour, origin, carrier)
     

And now we can run a randomization approach to χ2-statistic :


chisq_null_distn <- fli_small %>%
specify (origin ~ season) %>% # alt: response = origin, explanatory = season
hypothesize (null = "independence") %>%
generate (reps = 1000, type = "permute") %>%
calculate (stat = "Chisq")
chisq_null_distn %>% visualize (obs_stat = obs_chisq, direction = "greater")

6. janitor — simple tools for data cleaning in R

https://github.com/sfirke/janitor

Data cleansing is a topic very close to me. I’ve been working with my team at Iron-AI to create a tool for Python called Optimus. You can see more about it here:

Data cleansing and exploration with Python and Apache Spark — Big Data and Data Science — Optimus
The group of BBVA Data & Analytics in Mexico has been using Optimus for the past months and we have boosted our… hioptimus.com

But this tool I’m showing you is a very cool package with simple functions for data cleaning.

It has three main functions:

  • perfectly format data.frame column names;
  • create and format frequency tables of one, two, or three variables (think an improved table() ; and
  • isolate partially-duplicate records.

Oh, and it’s a tidyverse -oriented package. Specifically, it works nicely with the %>% pipe and is optimized for cleaning data brought in with the readr and readxl packages.

Installation

install.packages("janitor")

Usage

I’m using the example from the repo, and the data dirty_data.xlsx .

library(pacman) # for loading packages
p_load(readxl, janitor, dplyr, here)
roster_raw <- read_excel(here("dirty_data.xlsx")) # available at 
http://github.com/sfirke/janitor

glimpse(roster_raw)
#> Observations: 13
#> Variables: 11
#> $ `First Name` <chr> "Jason", "Jason", "Alicia", "Ada", "Desus", "Chien-Shiung", "Chien-Shiung", N...
#> $ `Last Name` <chr> "Bourne", "Bourne", "Keys", "Lovelace", "Nice", "Wu", "Wu", NA, "Joyce", "Lam...
#> $ `Employee Status` <chr> "Teacher", "Teacher", "Teacher", "Teacher", "Administration", "Teacher", "Tea...
#> $ Subject <chr> "PE", "Drafting", "Music", NA, "Dean", "Physics", "Chemistry", NA, "English",...
#> $ `Hire Date` <dbl> 39690, 39690, 37118, 27515, 41431, 11037, 11037, NA, 32994, 27919, 42221, 347...
#> $ `% Allocated` <dbl> 0.75, 0.25, 1.00, 1.00, 1.00, 0.50, 0.50, NA, 0.50, 0.50, NA, NA, 0.80
#> $ `Full time?` <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", NA, "No", "No", "No", "No", ...
#> $ `do not edit! --->` <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA
#> $ Certification <chr> "Physical ed", "Physical ed", "Instr. music", "PENDING", "PENDING", "Science ...
#> $ Certification__1 <chr> "Theater", "Theater", "Vocal music", "Computers", NA, "Physics", "Physics", N...
#> $ Certification__2 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA

With this:

roster <- roster_raw %>%
clean_names() %>%
remove_empty(c("rows", "cols")) %>%
mutate(hire_date = excel_numeric_to_date(hire_date),
cert = coalesce(certification, certification_1)) %>% # from dplyr
select(-certification, -certification_1) # drop unwanted columns

With the clean_names() function, we’re telling R that we’re about to use janitor. Then we clean the empty rows and columns, and then using dplyr we change the format for the dates, create a new column with the information of certification and certification_1 , and then drop them.

And with this piece of code…

roster %>% get_dupes(first_name, last_name)

we can find duplicated records that have the same name and last name.

The package also introduces the tabyl function that tabulates the data, like table but pipe-able, data.frame-based, and fully featured. For example:

roster %>%
tabyl(subject)
#> subject n percent valid_percent
#> Basketball 1 0.08333333 0.1
#> Chemistry 1 0.08333333 0.1
#> Dean 1 0.08333333 0.1
#> Drafting 1 0.08333333 0.1
#> English 2 0.16666667 0.2
#> Music 1 0.08333333 0.1
#> PE 1 0.08333333 0.1
#> Physics 1 0.08333333 0.1
#> Science 1 0.08333333 0.1
#> <NA> 2 0.16666667 NA

You can do a lot more things with the package, so visit their site and give them some love :)


5. Esquisse — RStudio add-in to make plots with ggplot2

https://github.com/dreamRs/esquisse

This add-in allows you to interactively explore your data by visualizing it with the ggplot2 package. It allows you to draw bar graphs, curves, scatter plots, and histograms, and then export the graph or retrieve the code generating the graph.

Installation

Install from CRAN with :

# From CRAN
install.packages("esquisse")

Usage

Then launch the add-in via the RStudio menu. If you don’t have data.frame in your environment, datasets in ggplot2 are used.

ggplot2 builder addin

Launch the add-in via the RStudio menu or with:

esquisse::esquisser()

The first step is to choose a data.frame :

Or you can use a dataset directly with:

esquisse::esquisser(data = iris)

After that, you can drag and drop variables to create a plot:

You can find information about the package and sub-menus in the original repo:

dreamRs/esquisse
RStudio add-in to make plots with ggplot2. Contribute to dreamRs/esquisse development by creating an account on GitHub. github.com

4. DataExplorer — Automate data exploration and treatment

https://github.com/boxuancui/DataExplorer

Exploratory Data Analysis (EDA) is an initial and important phase of data analysis/predictive modeling. During this process, analysts/modelers will have a first look of the data, and thus generate relevant hypotheses and decide next steps. However, the EDA process can be a hassle at times. This R package aims to automate most of data handling and visualization, so that users could focus on studying the data and extracting insights.

Installation

The package can be installed directly from CRAN.

install.packages("DataExplorer")

Usage

With the package you can create reports, plots, and tables like this:


## Plot basic description for airquality data
plot_intro (airquality)

## View missing value distribution for airquality data
plot_missing (airquality)

## Left: frequency distribution of all discrete variables
plot_bar (diamonds)
## Right: `price` distribution of all discrete variables
plot_bar (diamonds, with = "price")

## View histogram of all continuous variables
plot_histogram (diamonds)

You can find much more like this on the package’s official webpage:

Automate data exploration and treatment
Automated data exploration process for analytical tasks and predictive modeling, so that users could focus on… boxuancui.github.io

And in this vignette:

Introduction to DataExplorer
This document introduces the package DataExplorer, and shows how it can help you with different tasks throughout your… boxuancui.github.io

3. Sparklyr — R interface for Apache Spark

https://github.com/rstudio/sparklyr

Sparklyr will allow you to:

  • Connect to Spark from R. The sparklyr package provides a
    complete dplyr backend.
  • Filter and aggregate Spark datasets, and then bring them into R for
    analysis and visualization.
  • Use Spark’s distributed machine learning library from R.
  • Create extensions that call the full Spark API and provide
    interfaces to Spark packages.

Installation

You can install the Sparklyr package from CRAN as follows:


install.packages("sparklyr")

You should also install a local version of Spark for development purposes:



library
(sparklyr)
spark_install(version = "2.3.1")

Usage

The first part of using Spark is always creating a context and connecting to a local or remote cluster.

Here we’ll connect to a local instance of Spark via the spark_connect function:

library(sparklyr)
sc <- spark_connect(master = "local")

Using sparklyr with dplyr and ggplot2

We’ll start by copying some datasets from R into the Spark cluster (note that you may need to install the nycflights13 and Lahman packages in order to execute this code):

install.packages(c("nycflights13", "Lahman"))
library(dplyr)
iris_tbl <- copy_to(sc, iris)
flights_tbl <- copy_to(sc, nycflights13::flights, "flights")
batting_tbl <- copy_to(sc, Lahman::Batting, "batting")
src_tbls(sc)

## [1] "batting" "flights" "iris"

To start with, here’s a simple filtering example:

# filter by departure delay and print the first few records
flights_tbl %>% filter(dep_delay == 2)

## # Source:   lazy query [?? x 19]
## # Database: spark_connection
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 1 1 517 515 2 830
## 2 2013 1 1 542 540 2 923
## 3 2013 1 1 702 700 2 1058
## 4 2013 1 1 715 713 2 911
## 5 2013 1 1 752 750 2 1025
## 6 2013 1 1 917 915 2 1206
## 7 2013 1 1 932 930 2 1219
## 8 2013 1 1 1028 1026 2 1350
## 9 2013 1 1 1042 1040 2 1325
## 10 2013 1 1 1231 1229 2 1523
## # ... with more rows, and 12 more variables: sched_arr_time <int>,
## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>

Let’s plot the data on flight delays:

delay <- flights_tbl %>% 
group_by(tailnum) %>%
summarise(count = n(), dist = mean(distance), delay = mean(arr_delay)) %>%
filter(count > 20, dist < 2000, !is.na(delay)) %>%
collect
# plot delays
library(ggplot2)
ggplot(delay, aes(dist, delay)) +
geom_point(aes(size = count), alpha = 1/2) +
geom_smooth() +
scale_size_area(max_size = 2)

## `geom_smooth()` using method = 'gam'

Machine Learning with Sparklyr

You can orchestrate machine learning algorithms in a Spark cluster via the machine learning functions within Sparklyr. These functions connect to a set of high-level APIs built on top of DataFrames that help you create and tune machine learning workflows.

Here’s an example where we use ml_linear_regression to fit a linear regression model. We’ll use the built-in mtcars dataset to see if we can predict a car’s fuel consumption ( mpg ) based on its weight ( wt ), and the number of cylinders the engine contains ( cyl ). We’ll assume in each case that the relationship between mpg and each of our features is linear.



# copy mtcars into spark

mtcars_tbl <- copy_to(sc, mtcars)


# transform our data set, and then partition into 'training', 'test'

partitions <- mtcars_tbl %>%
filter(hp >= 100) %>%
mutate(cyl8 = cyl == 8) %>%
sdf_partition(training = 0.5, test = 0.5, seed = 1099)


# fit a linear model to the training dataset

fit <- partitions$training %>%
ml_linear_regression(response = "mpg", features = c("wt", "cyl"))
fit

## Call: ml_linear_regression.tbl_spark(., response = "mpg", features = c("wt", "cyl"))  
##
## Formula: mpg ~ wt + cyl
##
## Coefficients:
## (Intercept) wt cyl
## 33.499452 -2.818463 -0.923187

For linear regression models produced by Spark, we can use summary() to learn a bit more about the quality of our fit and the statistical significance of each of our predictors.


summary(fit)

## Call: ml_linear_regression.tbl_spark(., response = "mpg", features = c("wt", "cyl"))  
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.752 -1.134 -0.499 1.296 2.282
##
## Coefficients:
## (Intercept) wt cyl
## 33.499452 -2.818463 -0.923187
##
## R-Squared: 0.8274
## Root Mean Squared Error: 1.422

Spark machine learning supports a wide array of algorithms and feature transformations, and as illustrated above, it’s easy to chain these functions together with dplyr pipelines.

Check out more about machine learning with sparklyr here:

sparklyr
An R interface to Spark spark.rstudio.com

And more information in general about the package and examples here:

sparklyr
An R interface to Spark spark.rstudio.com

2. Drake — An R-focused pipeline toolkit for reproducibility and high-performance computing

Drake programming

Nope, just kidding. But the name of the package is drake !

https://github.com/ropensci/drake

This is such an amazing package. I’ll create a separate post with more details about it, so wait for that!

Drake is a package created as a general-purpose workflow manager for data-driven tasks. It rebuilds intermediate data objects when their dependencies change, and it skips work when the results are already up to date.

Also, not every run-through starts from scratch, and completed workflows have tangible evidence of reproducibility.

Reproducibility, good management, and tracking experiments are all necessary for easily testing others’ work and analysis. It’s a huge deal in Data Science, and you can read more about it here:

From Zach Scott :

Data Science’s Reproducibility Crisis
What is Reproducibility in Data Science and Why Should We Care? towardsdatascience.com
Toward Reproducibility: Balancing Privacy and Publication
Can there ever be a Goldilocks option in the conflict between data security and research disclosure? towardsdatascience.com

And in an article by me :)

Manage your Machine Learning Lifecycle with MLflow — Part 1.
Reproducibility, good management and tracking experiments is necessary for making easy to test other’s work and… towardsdatascience.com

With drake , you can automatically

  1. Launch the parts that changed since last time.
  2. Skip the rest.

Installation



# Install the latest stable release from CRAN.

install.packages ("drake")


# Alternatively, install the development version from GitHub.

install.packages ("devtools")
library (devtools)
install_github ("ropensci/drake")

There are some known errors when installing from CRAN. For more on these errors, visit:

The drake R Package User Manual
The drake R Package User Manualropenscilabs.github.io

I encountered a mistake, so I recommend that for now you install the package from GitHub.

Ok, so let’s reproduce a simple example with a twist:

I added a simple plot to see the linear model within drake ’s main example. With this code, you’re creating a plan for executing your whole project.

First, we read the data. Then we prepare it for analysis, create a simple hist, calculate the correlation, fit the model, plot the linear model, and finally create a rmarkdown report.

The code I used for the final report is here:

If we change some of our functions or analysis, when we execute the plan, drake will know what has changed and will only run those changes. It creates a graph so you can see what’s happening:

Graph for analysis

In Rstudio, this graph is interactive, and you can save it to HTML for later analysis.

There are more awesome things that you can do with drake that I’ll show in a future post :)


1. DALEX — Descriptive mAchine Learning EXplanations

https://github.com/pbiecek/DALEX

Explaining machine learning models isn’t always easy. Yet it’s so important for a range of business applications. Luckily, there are some great libraries that help us with this task. For example:

thomasp85/lime
lime — Local Interpretable Model-Agnostic Explanations (R port of original Python package) github.com

(By the way, sometimes a simple visualization with ggplot can help you explain a model. For more on this check the awesome article below by Matthew Mayo )

Interpreting Machine Learning Models: An Overview
An article on machine learning interpretation appeared on O’Reilly’s blog back in March, written by Patrick Hall, Wen… www.kdnuggets.com

In many applications, we need to know, understand, or prove how input variables are used in the model, and how they impact final model predictions. DALEX is a set of tools that helps explain how complex models are working.

To install from CRAN, just run:


install.packages("DALEX")

They have amazing documentation on how to use DALEX with different ML packages:

Great cheat sheets:

https://github.com/pbiecek/DALEX
https://github.com/pbiecek/DALEX

Here’s an interactive notebook where you can learn more about the package:

Binder (beta)
Edit description mybinder.org

And finally, some book-style documentation on DALEX , machine learning, and explainability:

DALEX: Descriptive mAchine Learning EXplanations
Do not trust a black-box model. Unless it explains itself. pbiecek.github.io

Check it out in the original repository:

pbiecek/DALEX
DALEX — Descriptive mAchine Learning EXplanations github.com